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Abstract 


Adaptive filtering techniques are necessary considerations when a 
specific signal output is desired but the coefficients of that filter 
cannot be determined at the outset. Sometimes this is because of 
changing line or transmission conditions. An adaptive filter is one 
which contains coefficients that are updated by an adaptive 
algorithm to optimize filter response to the desired performance 
criterion. 


Two devices, the TMS320C25 and TMS320C30, combine the 
power, high speed, flexibility and architecture optimized for 
adaptive signal processing. 


This book discusses the topic of adaptive filter implementation as 
they apply to these two processors. 


The book begins with a description of the two parts of an adaptive 
filter: the filter and the adaptive algorithm. The book goes on to 
discuss: 


UY The applications of adaptive filters (including adaptive 
prediction, equalization, noise cancellation and echo 
cancellation). 


O) The implementation of adaptive structures and algorithms 
(including transversal structure with the LMS algorithm, 
symmetric transversal structure, lattice structure, and modified 
LMS algorithms) 


U) Implementation considerations (including dynamic range 
constraint, finite precision errors, and design issues) 


SPRA116 


O) Software development (assembly function libraries, C function 
libraries, development process and environment) 


The book also contains: 


UY Tables showing transversal structure, symmetric transversal 
structure and lattice structure for both the TMS320C25 and 
TMS320C30 processors 


LY Extensive references 


O) Multiple appendices of sample code for both TMS320C25 and 
TMS320C30 processors 
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Introduction 


A filter selects or controls the characteristics of the signal it produces by condition- 
ing the incoming signal. The coefficients of the filter determine its characteristics and output 
a priori in many cases. Often, a specific output is desired, but the coefficients of the filter 
cannot be determined at the outset. An example is an echo canceller; the desired output 
cancels the echo signal (an output result of zero when there is no other input signal). In 
this case, the coefficients cannot be determined initially since they depend on changing 
line or transmission conditions. For applications such as this, it is necessary to rely on 
adaptive filtering techniques. 


An adaptive filter is a filter containing coefficients that are updated by an adaptive 
algorithm to optimize the filter’s response to a desired performance criterion. In general, 
adaptive filters consist of two distinct parts: a filter, whose structure is designed to per- 
form a desired processing function; and an adaptive algorithm, for adjusting the coeffi- 
cients of that filter to improve its performance, as illustrated in Figure 1. The incoming 
signal, x(n), is weighted in a digital filter to produce an output, y(n). The adaptive algorithm 
adjusts the weights in the filter to minimize the error, e(n), between the filter output, y(n), 
and the desired response of the filter, d(n). Because of their robust performance in the 
unknown and time-variant environment, adaptive filters have been widely used from 
telecommunications to control. 


d(n) 


e(n) 


FILTER 
STRUCTURE 


x(n) y(n) 


Figure 1. General Form of an Adaptive Filter 
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Adaptive filters can be used in various applications with different input and output 
configurations. In many applications requiring real-time operation, such as adaptive predic- 
tion, channel equalization, echo cancellation, and noise cancellation, an adaptive filter 
implementation based on a programmable digital signal processor (DSP) has many ad- 
vantages over other approaches such as a hard-wired adaptive filter. Not only are power, 
space, and manufacturing requirements greatly reduced, but also programmability pro- 
vides flexibility for system upgrade and software improvement. 


The early research on adaptive filters was concerned with adaptive antennas [1] and 
_ adaptive equalization of digital transmission systems [2]. Much of the reported research 
on the adaptive filter has been based on Widrow’s well-known Least Mean Square (LMS) 
algorithm, because the LMS algorithm is relatively simple to design and implement, and 
it is well-understood and well-suited for many applications. All the filter structures and 
update algorithms discussed in this application report are Finite Impulse Response (FIR) 
filter structures and LMS-type algorithms. However, for a particular application, adap- 
tive filters can be implemented in a variety of structures and adaptation algorithms [1, 
3 through 9]. These structures and algorithms generally trade increased complexity for 
improved performance. An interactive software package to evaluate the performance of 
adaptive filters has also been developed [10]. 


The complexity of an adaptive filter implementation is usually measured in terms 
of its multiplication rate and storage requirement. However, the data flow and data 
manipulation capabilities of a DSP are also major factors in implementing adaptive filter 
systems. Parallel hardware multiplier, pipeline architecture, and fast on-chip memory size 
are major features of most DSPs [11, 12] and can make filter implementation more efficient. 


Two such devices, the TMS320C25 and TMS320C30 from Texas Instruments [13, 

14], have been chosen as the processors for fixed-point and floating-point arithmetic. They 
combine the power, high speed, flexibility, and an architecture optimized for adaptive 
signal processing. The instruction execution time is 80 ns for the TMS320C25 and only 
60 ns for the TMS320C30. Most instructions execute in a single cycle, and the architec- 
tures of both processors make it possible to execute more than one operation per instruc- 
tion. For example, in one instruction, the TMS320C25 processor can generate an instruction 
address and fetch that instruction, decode the instruction, perform one or two data moves 
(if the second data is from program memory), update one address pointer, and perform 
one or two computations (multiplication and accumulation). These processors are 
designed for real-time tasks in telecommunications, speech processing, image process- 
ing, and high-speed control, etc. 


To direct the present research toward realistic real-time applications, three adaptive 
structures were implemented: 


1. Transversal 
2. Symmetric transversal 
3. Lattice 


Each structure utilizes five different update algorithms: 


1. LMS 

2. Normalized LMS 
3. Leaky LMS 

4. Sign-error LMS 
5. Sign-sign LMS 


Each structure with its adaptation algorithms is implemented using the TMS320C25 
with fixed-point arithmetic and the TMS320C30 with floating-point arithmetic. The pro- 
cessor assembly code is included in the Appendix for each implementation. The assembly 
code for each structure and adaptation strategy can be readily modified by the reader to 
fit his/her applications and could be incorporated into a C function library as callable 
routines. 


In this application report, the applications of adaptive filters, such as adaptive predic- 
tion, adaptive equalization, adaptive echo cancellation, and adaptive noise cancellation 
are presented first. Next, the implementation of the three filter structures and five adap- 
tive algorithms with the TMS320C25 and TMS320C30 is described. This is followed by 
the practical considerations on the implementation of these adaptive filters. The remainder 
of the application report covers coding options, such as the routine libraries that support 
both assembly and C languages. 


Applications of Adaptive Filters 


The most important feature of an adaptive filter is the ability to operate effectively 
in an unknown environment and track time-varying characteristics of the input signal. The 
adaptive filter has been successfully applied to communications, radar, sonar, control, 
and image processing. Figure 1 illustrates a general form of an adaptive filter with input 
signals, x(n) and d(n), output signal, y(n), and error signal, e(n), which is the difference 
between the desired signal, d(n), and output signal, y(n). The adaptive filter can be used 
in different applications with different input/output configurations. In this section we briefly 
discuss several potential applications for the adaptive filters [15]. 


Adaptive Prediction 


Adaptive prediction [16 through 18] is illustrated in Figure 2. In the general ap- 
plication of adaptive prediction, the signals are x(n) — delayed version of original signal, 
d(n) — original input signal, y(n) — predicted signal, and e(n) — prediction error or 
residual. 
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e(n) 


ADAPTIVE 
FILTER 


Figure 2. Block Diagram of an Adaptive Predictor 


A major application of the adaptive prediction is the waveform coding of a speech 
signal. The adaptive filter is designed to exploit the correlation between adjacent samples 
of the speech signal so that the prediction error is much smaller than the input signal on 
the average. This prediction error signal is quantized and sent to the receiver in order 
to reduce the number of bits required for the transmission. This type of waveform coding 
is called Adaptive Differential Pulse-Code Modulation (ADPCM) [17] and provides data 
rate compression of the speech at 32 kb/s with toll quality. More recently, in certain on- 
line applications, time recursive modeling algorithms have been proposed to facilitate speech 
modeling and analysis. 


The coefficients of the adaptive predictor can be used as the autoregressive (AR) 
parameters of the nonstationary model. The equation of the AR process is 


u(n) = a,;* u(n—1) + ap* u(n—2) + ...... + an* u(n—m) + vin) 


where aj, a2, ...., Am are the AR parameters. Thus, the present value of the process u(n) 
equals a finite linear combination of past values of the process plus an error term v(n). 
This adaptive AR model provides a practical means to measure the instantaneous frequen- 
cy of input signal. The adaptive predictor can also be used to detect and enhance a narrow 
band signal embedded in broad band noise. This Adaptive Line Enhancer (ALE) provides 
at its output y(n) a sinusoid with an enhanced signal-to-noise ratio, while the sinusoidal 
components are reduced at the error output e(n). 


Adaptive Equalization 


Figure 3 shows another model known as adaptive equalization [2, 9, 15]. The signals 
in the adaptive equalization model are defined as x(n) — received signal (filtered version 
of transmitted signal) plus channel noise, d(n) — detected data signal (data mode) or pseudo 
random number (training mode), y(n) — equalized signal used to detect received data, 
and e(n) — residual intersymbol interference plus noise. 


DATA TRAINING 
MODE MODE 
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ADAPTIVE RANDOM 
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Figure 3. Block Diagram of an Adaptive Equalizer 


The use of adaptive equalization to eliminate the amplitude and phase distortion in- 
troduced by the communication channel was one of the first applications of adaptive filtering 
in telecommunications [19]. The effect of each symbol transmitted over a time-dispersive 
channel extends beyond the time interval used to represent that symbol, resulting in an 
overlay of received symbols. Since most channels are time-varying and unknown in ad- 
vance, the adaptive channel equalizer is designed to deal with this intersymbol interference 
and is widely used for bandwidth-efficient transmission over telephone and radio channels. 


Adaptive Echo Cancellation 


Another application, known as adaptive echo cancellation [20, 21] is shown in Figure 
4. In this application, the signals are identified as x(n) — far-end signal, d(n) — echo 
of far-end signal plus near-end signal, y(n) — estimated echo of far-end signal, and e(n) 
— near-end signal plus residual echo. 
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Figure 4. Block Diagram of an Echo Canceller 


The adaptive echo cancellers are used in practical applications of cancelling echoes 
for long-distance telephone voice communication, full-duplex voiceband data modems, 
and high-performance audio-conferencing systems. To overcome the echo problem, echo 
cancellers are installed at both ends of the network. The cancellation is achieved by 
estimating the echo and subtracting it from the return signal. 


Adaptive Noise Cancellation 


One of the simplest and most effective adaptive signal processing techniques is adap- 
tive noise cancelling [1, 22]. As shown in Figure 5, the primary input d(n) contains both 
signal and noise, where x(n) is the noise reference input. An adaptive filter is used to 
estimate the noise in d(n) and the noise estimate y(n) is then subtracted from the primary 
channel. The noise cancellation output is then the error signal e(n). 


The applications of noise cancellation include the cancellation of various forms of 
interference in electrocardiography, noise in speech signals, noise in fighter cockpit en- 
vironments, antennas sidelobe interference, and the elimination of 60-Hz hum. In the ma- 
jority of these noise cancellation applications, the LMS algorithm has been utilized. 


SIGNAL 
SOURCE 


e(n) 


ADAPTIVE 
FILTER 
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Figure $5. General Form of a Noise Canceller 


Application Summary 


The above list of applications is not exhaustive and is limited primarily to applica- 
tions within the field of telecommunications. Adaptive filtering has been used extensively 
in the context of many other fields including, but not limited to, instantaneous frequency 
tracking, intrusion detection, acoustic Doppler extraction, on-line system identification, 
geophysical signal processing, biomedical signal processing, the elimination of radar clutter, 
beamforming, sonar processing, active sound cancellation, and adaptive control. 


Implementation of Adaptive Structures and Algorithms 


Several types of filter structures can be implemented in the design of the adaptive 
filters such as Infinite Impulse Response (IIR) or Finite Impulse Response (FIR). An adap- 
tive IIR filter [1, 5], with poles as well as zeros, makes it possible to offer the same filter 
characteristics as the FIR filter with lower filter complexity. However, the major pro- 
blem with adaptive IIR filter is the possible instability of the filter if the poles move out- 
side the unit circle during the adaptive process. In this application report, only FIR structure 
is implemented to guarantee filter stability. 


An adaptive FIR filter can be realized using transversal, symmetric transversal, and 
lattice structures. In this section, the adaptive transversal filter with the LMS algorithm 
is introduced and implemented first to provide a working knowledge of adaptive filters. 


Transversal Structure with LMS Algorithm 
Transversal Structure Filter 


The most common implementation of the adaptive filter is the transversal structure 
(tapped delay line) illustrated in Figure 6. The filter output signal y(n) is 


N-1 
y(n) = wl(n)x(n) = = s(n) x(n) (1) 
i=0 


where x(n)=[x(n) x(n—1) ... x(n-N+ 1)]T is the input vector, w(n)=[wo(n) w,(n) ... 
Wn_1(n)]T is the weight vector, T denotes transpose, n is the time index, and N is the 
order of filter. This example is in the form of a finite impulse response filter as well as 
the convolution (inner product) of two vectors x(n) and w(n). The implementation of Equa- 
tion (1) is illustrated using the following C program: 


y[n] = 0.; 
for (i = 0; i < N; i++) { 
yin] += wnfi)*xn{i); 


where wn [i] denotes wi(n) and xn[i] represents x(n—1). 
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Figure 6. Transversal Filter Structure 


IMS320C25 Implementation 


The architecture of TMS320C25 [13] is optimized to implement the FIR filter. After 
execution of the CNFP (Configure Block BO as Program Memory) instruction, the filter 
coefficients w;(n) from RAM block BO (via program bus) and data x(n—i) from RAM 
block B1 (via data bus) are available simultaneously for the parallel multiplier (see Figure 7). 


ARn 


MULTIPLER 


Figure 7. TMS320C25 Arithmetic Unit (after execute CNFP instruction) 


The MACD instruction enables complete multiply/accumulate, data move, and pointer 
update operations to be completed in a single instruction cycle (80 ns) if filter coefficients 
are stored in on-chip RAM or ROM or in off-chip program memory with zero wait states. 
Since the adaptive weights w;(n) need to be updated in every iteration, the filter coeffi- 
cients must be stored in RAM. The implementation of the inner product in Equation (1) 
can be made even more efficient with a repeat instruction, RPTK. An N-weight transver- 
sal filter can be implemented as follows [23]: 


LARP ARn 

LRLK ARn,LASTAP 

RPTK N-1 

MACD COEFFP, * — (A) 


Where ARn is an auxiliary address register that points to x(n—N+1), and the Prefetch 
Counter (PFC) points to the last weight wy — ;(n) indicated by COEFFP. When the MACD 
instruction is repeated, the coefficient address is transferred to the PFC and is incremented 
by one during its operation. Therefore, the components of weight vector w(n) are stored 
in BO as 


Low Address 


PFC 


k 


High Address 


The MACD in repeat mode will also copy data pointed to by ARn, to the next higher 
on-chip RAM location. The buffer memories of transversal filter are therefore stored as 


Low Address 


x(n) 


0ee30e 


High Address 


ARn 


In general, roundoff noise occurs after each multiplication. However, the 
TMS320C25 has a 16 x 16-bit multiplier and a 32-bit accumulator, so there is no roundoff 
during the summing of a set of product terms in Program (A). All multiplication products 
are represented in full precision, and rounding is performed after they are summed. Thus 
y(n) is obtained from the accumulator with only one roundoff, which minimizes the round- 
off noise in the output y(n). Since both the tapped delay line and the adaptive weights 
are stored in data RAM to achieve the fastest throughput, the highest transversal filter 
order for efficient implementation on the TMS320C25 is 256. However, if necessary, 
higher order filters can be implemented by using external data RAM. 


TMS320C30 Implementation 


The architecture of TMS320C30 [14] is quite different from TI’s second generation 
processors. Instead of using program/data memory, it provides two data address buses 
to do the data memory manipulations. This feature allows two data memory addresses 
to be generated at the same time. Hence, parallel data store, load, or one data store with 
one data load can be done simultaneously. Such capabilities make the programming much 
easier and more flexible. Since the hardware multiplier and arithmetic logic unit (ALU) 
of TMS320C30 are separated, with proper operand arrangement, the processor can do 
one multiplication and one addition or subtraction at the same time. With these two com- 
bined features, the TMS320C30 can execute several other parallel instructions. These 
parallel instructions can be found in Section 11 of the Third-Generation TMS320 User’s 
Guide [14]. Associating with single repeat instruction RPTS, an inner product in Equa- 
tion (1) can be implemented as follows: 


MPYF3 *ARO+ +(1)%,*AR1++(1)%,R1 ; w[0]}.x[0] 
RPTS N-2 ; Repeat N—1 times 
MPYF3 *ARO+ +(1)%,*AR1++(1)%,R1 > yf] = wf{].x[] 
| | ADDF3 R1,R2,R2 
ADDF3 R1,R2,R2 ; Include last product 


where auxiliary registers ARO and ARI point to x and w arrays. The addition in the parallel 
instruction sums the previous values of R1 and R2. Therefore, R1 is initialized with the 
first product prior to the repeat instruction RPTS. 


Note that the implementation above does not move the data in the x array like MACD 
does in TMS320C25. For filter delay taps, the TMS320C30 uses a circular buffer method 
to implement the delay line. This method reserves a certain size of memory for the buffer 
and uses a pointer to indicate the beginning of the buffer. Instead of moving data to next 
memory location, the pointer is updated to point to the previous memory location. 
Therefore, from the new beginning of the buffer, it has the effect of the tapped delay line. 
When the value of the pointer exceeds the end of the buffer, it will be circled around 
to the other end of the buffer. It works just like joining two ends of the buffer together 
as a necklace. Thus, new data is within the circular queue, pointed to by ARO, replacing 


Aa fra 


the oldest value. However, from an adaptive filter point of view, data doesn’t have to 
be moved at this point yet. | 


TMS320C30 has a 32-bit floating point multiplier and the result from the multiplier is 
put and accumulated into a 40-bit extended precision register. If the input from A/D con- 
verter is equal to or less than 16 bits, there is no roundoff noise after multiplication. 
Theoretically, the TMS320C30 can implement a very high order of adaptive filter. 
However, for the most efficient implementation, the limitation of filter order is 2K because 
the TMS320C30 external data write requires at least two cycles. If the filter coefficients 
are put in somewhere other than internal data RAM, the instruction cycles will be increased. 


LMS Adaptation Algorithm 
The adaptation algorithm uses the error signal 


e(n) = d(n)—y(n), (2) 


where d(n) is the desired signal and y(n) is the filter output. The input vector x(n) and 
e(n) are used to update the adaptive filter coefficients according to a criterion that is to 
be minimized. The criterion employed in this section is the mean-square error (MSE)e: 


€ = E[e2(n)] | (3) 


where E [.] denotes the expectation operator. If y(n) from Equation (1) is substituted into 
Equation (2), then Equation (3) can be expressed as 


€ = E[d2(n)] + wl(n)Rw(n) — 2 wi (n)p (4) 


where R = E[x(n)x!(n)] is the N x N autocorrelation matrix, which indicates the sample- 
to-sample correlation within a signal, and p = E [d(n) x(n)] is the N x 1 cross-correlation 
vector, which indicates the correlation between the desired signal d(n) and the input signal 
vector x(n). 

The optimum solution w* = [wo* wi* ... WN—1*]', which minimizes MSE, is de- 
rived by solving the equation 


b€ 


6win) ° ” 


This leads to the normal equation 


R w* =p © 


If the R matrix has full rank (i.e., R—! exists), the optimum weights are obtained by 


w* = R-!p (7) 


In Linear Predictive Coding (LPC) of a speech signal, the input speech is divided 
into short segments, the quantities of R and p are estimated, and the optimal weights cor- 
responding to each segment are computed. This procedure is called a block-by-block data- 
adaptive algorithm [24]. 


A widely used LMS algorithm is an alternative algorithm that adapts the weights 
on a sample-by-sample basis. Since this method can avoid the complicated computation 
of R—! and p, this algorithm is a practical method for finding close approximate solutions 
to Equation (7) in real time. The LMS algorithm is the steepest descent method in which 
the next weight vector w(n+ 1) is increased by a change proportional to the negative gra- 
dient of mean-square-error performance surface in Equation (7) 


w(n+1) = w(n) — uV () (8) 


where u is the adaptation step size that controls the stability and the convergence rate. 
For the LMS algorithm, the gradient at the nth iteration, V (n), is estimated by assuming 
squared error e2(n) as an estimate of the MSE in Equation (3). Thus, the expression for 
the gradient estimate can be simplified to 


d[e2(n)] 
Via) = ———— = — 2 ef) x(n) (9) 
—= dw(n) | 


Substitution of this instantaneous gradient estimate into Equation (8) yields the 
Widrow-Hoff LMS algorithm 


w(n+1) = w(n) + 2 u e(n) x(n) (10) 


where 2 u in Equation (10) is usually replaced by u in practical implementation. 


Starting with an arbitrary initial weight vector w(0), the weight vector w(n) will 
converge to its optimal solution w*, provided u is selected such that [1] 


1 
<< = (11). 


Amax 


where \max is the largest eigenvalue of the matrix R. \max can be bounded by 


N-1 
max < TTIR}= L 1) =Nr(0) (12) 
i=0 


where Tr [.] denotes the trace of a matrix and r(0) = E [x2(n)] is average input power. 


For adaptive signal processing applications, the most important practical considera- 
tion is the speed of convergence, which determines the ability of the filter to track nonsta- 
tionary signals. Generally speaking, weight vector convergence is attained only when the 
slowest weight has converged. The time constant of the slowest mode is [1] 


1 


t= Uvnin (13) 


This indicates that the time constant for weight convergence is inversely propor- 
tional to u and also depends on the eigenvalues of the autocorrelation matrix of the input. 
With the disparate eigenvalues, i.e., \max> > Amin, the setting time is limited by the 
slowest mode, \min- Figure 8 shows the relaxation of the mean square error from its in- 
itial value € 9 toward the optimal value Emin. 


Adaptation based on a gradient estimate results in noise in the weight vector, therefore 
a loss in performance. This noise in the adaptive process causes the steady state weight 
vector to vary randomly about the optimum weight vector. The accuracy of weight vector 
in steady state is measured by excess mean square error (excess MSE = E [€ — Emin]). 
The excess MSE in the LMS algorithm [1] is 


excess MSE = u Tr[R] Emin (14) 


where €,,i is minimum MSE in the steady state. 


Equations (13) and (14) yield the basic trade-off of the LMS algorithm: to obtain 
high accuracy (low excess MSE) in the steady state, a small value of u is required, but 
this will slow down the convergence rate. Further discussions of the characteristics and 
properties of the LMS algorithm are presented in [1, 3 through 9]. The implementations 
of LMS algorithm with the TMS320C25 and TMS320C30 are presented next. 
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Figure 8. Learning Curve of an Adaptive Transversal Filter and an LMS 
Algorithm with Different Step Sizes 
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Figure 8. Learning Curve of an Adaptive Transversal Filter and an LMS 
Algorithm with Different Step Sizes 


Since u*e(n) is constant for N weights update, the error signal e(n) is first multiplied 
by u to get ue(n). This constant can be computed first and then multiplied by x(n) to up- 
date w(n). An implementation method of the LMS algorithm in Equation (10) is illustrated 
as 


ue(n) = u*e[n]; 

for (i=0; i<N; i++) { 
wn{i] += uen * xn{i]; 

j 


TMS320C25 Implementation 


The TMS320C25 provides two powerful instructions (ZALR and MPYA) to per- 
form the update example in Equation (10). 


e ZALR loads a data memory value into the high-order half of the ac- 
cumulator while rounding the value by setting bit 15 of the accumulator 
to one and setting bits 0-14 of the accumulator to zero. The rounding is 
necessary because it can reduce the roundoff noise from multiplication. 


e MPYA accumulates the previous product in the P register and multiplies 
the operand with the data in T register. 


Assuming that ue(n) is stored in T and the address pointer is pointing to AR3, the 
adaptation of each weight is shown in the following instruction sequence: 


LRLK ARI1,N-1 ; Initialize loop counter 

LRLK AR2,COEFFD ; Point to wN—;(n) 

LRLK AR3,LASTAP+1 __ ; Point to x(n—N+1), since MACD in (A) 
; Already moved elements of current 
; x(n) to the next higher location 


MPY *-—,AR2 - P=ue(n) * x(n—N+1) 
ADAP ZALR *,AR3 ; Load w;(n) and round > 
MPYA *-—,AR2 ; ACC=P+w;(n) and P=ue(n) * x(n—1) 
SACH *+,0,AR1 > Store w,(n+ 1) 
BANZ ADAP,*—,AR2 ; Test loop counter, if counter not 


; Equal to 0, decrement counter, 
- Branch to ADAP and select AR2 as 
; Next pointer. 


For each iteration, N instruction cycles are needed to perform Equation (1), 6N in- 
struction cycles are needed to perform weight updates in Equation (10), and the total number 
of instruction cycles needed is 7N+28. An example of a TMS320C25 program implement- 
ing a LMS transversal filter is presented in Appendix Al. Note that BANZ needs three 
instruction cycles to execute. This can be avoided by using straight line code, which re- 
quires 4N+33 instruction cycles [25]. 


TMS320C30 Implementation 


Although the TMS320C30 doesn’t provide any specific instruction for adaptive filter 
coefficients update, it still can achieve the weight updating in two instructions because 
of its powerful architecture. The TMS320C30 has a repeat block instruction RPTB, which 
allows a block of instructions to be repeated a number of times without any penalty for 
looping. A single repeat mode, RM, in the status register, ST, and three registers - repeat 
start address (RS), repeat end address (RE), and repeat counter (RC) - control the block 
repeat. When RM is set, the PC repeats the instructions between RS and RE a number 
of times, which is determined by the value of RC. The repeat modes repeat a block of 
code at least once in a typical operation. The repeat counter should be loaded with one 
less than the desired number of repetitions. Assuming the error signal e(n) in Equation 
(10) is stored in R7, the adaptation of filter coefficients is shown as follows: 


MPYF3 *ARO++(1)%,R7,R1_ ; RI = u*e(n)*x(n) 


LDI order —3,RC ; Initialize repeat counter 
RPTB LMS ; Doi = 0, N-3 
MPYF3 *ARO++(1)%,R7,R1_ : Compute u*e(n)*x(n—i-1) 
| |ADDF3 *AR1,R1,R2 ; Compute wi(n) + u*e(n)*x(n—i) 
LMS __ STF R2,*AR1++(1)% ; Store wi(n+ 1) 
MPYF3 *ARO,R7,R1 ; Fori = N-2 
| |ADDF3 *AR1,R1,R2 
STF R2,*AR1++(1)% ; Store wN—2(n+1) 
ADDF3 *ARI1,R1,R2 ; Include last w 
STF R2,*ARI1++(1)% ; Store wN—1(n+1) 


where auxiliary register ARO and AR1 point to x and w arrays. R1 is updated before loop 
since the accumulation in the parallel instruction uses the previous value in R1. In order 
to update x array pointer to the new beginning of the data buffer for next iteration (1.e., 
perform the data move), one of the loop instruction set has been taken out of loop and 
modified by eliminating the incrementation of ARO. 


To perform an N—weight adaptive LMS transversal filter on TMS320C30 requires 
3N +15 instruction cycles. There are N and 2N instruction cycles to perform Equations 
(1) and (10), respectively. The TMS320C30 example program is given in Appendix A2. 


The LMS algorithm considerably reduces the computational requirements by using 
a simplified mean square error estimator (an estimate of the gradient). This algorithm has 
proved useful and effective in many applications. However, it has several limitations in 
performance such as the slow initial convergence, the undesirable dependence of its con- 
vergence rate on input signal statistics, and an excess mean square error still in existence 
after convergence. 


Symmetric Transversal Structure [5] 


A transversal filter with symmetric impulse response (weight values) about the center 
weight has a linear phase response. In applications such as speech processing, linear phase 
filters are preferred since they avoid phase distortion by causing all the components in 
the filter input to be delayed by the same amount. The adaptive symmetric transversal 
structure is shown in Figure 9. 
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Figure 9. Symmetric Transversal Structure (even order) 


This filter is actually an FIR filter with an impulse response that is symmetric about 
the center tap. The output of the filter is obtained as | 


N/2-! 
y(n) = Pa win) [x(n—i) + x(n—N+i+1)] (15a) 
1= 


where N is an even number. Note that, for fixed-point processors, the addition in the 
brackets may introduce overflow because the input signals x(n—1) and x(n—N+i+1) are 
in the range of —1 and 1—2~-15. This problem can be solved by shifting x(n) to the right 
one bit. The update of the weight vector is 


w(n+1) = wn) + ue(n)[x(n—1) + xm—N+t+i+1)] (15b) 


for i=0,1,...,(N/2—1), which requires N/2 multiplications and N additions. Theoretical- 
ly, this symmetric structure can also reduce computational complexity since such filters 
require only half the multiplications of the general transversal filter. However, it is true 
only for the TMS320C30 processor. When a filter is implemented on the TMS320C25, 
the transversal structure is more efficient than the symmetric transversal structure due 
to the pipeline multiplication and accumulation instruction MACD, which is optimized 
to implement convolution in Equation (1). 


TMS320C25 Implementation 


For TMS320C25, in order to implement the instructions MAC, ZALR, and MPYA, 
we can trade memory requirements for computation saving by defining 


z(n—i) = x(n—i) + x(n—N+i+1) , i=0,1,...,N/2-1 (16a) 
Now, Equation (15) can be expressed as 


N/2-1 
y(n) = = w;(n) z(n—i) (16b) 
] = 


wj(n+1) = wj(n) + u e(n) z(n—i) , i=0,1,...,N/2-! (16c) 


Equation (16a) can be implemented using the TMS320C25 as 


LARK ARI, N/2--1 ; Counter = N/2 —! 
LRLK AR2,LAST_X ; Point to x(n—N+1) 
LRLK AR3,FIRST__X _ ; Point to x(n) 
LRLK AR4,FIRST__Z_ ; Point to z(n) 
LARP AR3 

SYM LAC *+ 0,AR2 
ADD *— 0,AR4 
SACL *+ 0,ARI1 
BANZ  SYM,*-—,AR3 


The instruction sequence to implement the LMS algorithm in Equations (1) and (10) 
can be used to implement Equations (16b) and (16c), except using MAC instead of MACD 
in Program (A). Therefore, N instruction cycles are needed to shift data in x(n), 3N in- 
struction cycles are needed to implement Equation (16a), N/2 for Equation (16b), and 
3N for Equation (16c). The total number of instruction cycles required to implement the 
symmetric transversal filter with the LMS algorithm is 7.5N+38. Where 7.5N is an in- 
teger because N is chosen as an even number. The 0.5N instruction cycles come from 
Equation (15a) since symmetric transversal structure folds the filter taps into half of the 
order N (see Figure 9). The maximum filter length for most efficient code, 256, is the 


same as for the FIR filter. The use of the additional data memory can be obtained from 
the reduced data memory requirement for weights of the symmetric transversal filter. The 
complete TMS320C25 program is given in Appendix B1. 


Note that instead of storing buffer locations x(n) contiguously, then using DMOV 
to shift data in the buffer memory (requiring N cycles) at the end of each iteration, we 
can use a circular buffer with pointers pointing to x(n) and x(n—N +1). Since pointer up- 
dating requires several instruction cycles, compared with N cycles using DMOV to up- 
date the buffer memory contents, the circular buffer technique is more efficient if N is large. 


TMS320C30 Implementation 


As mentioned above, the TMS320C30 uses a circular buffer instead of data move 
technique. Therefore, it does not have to implement tapped delay line separately as 
TMS320C25. Equations (1) and (16a) can be combined and implemented in the same loop. 
The advantage of this is that a parallel instruction reduces the number of the instruction 
cycles. The implementation is shown as follows: 


LDF 0.0,R2 ; Clear R2 


LDI order/2 —2,RC ; Set up loop counter 
RPTB INNER ; Doi = 0, N/2 ~2 
ADDF3 *AR4++(1)%,*ARS— —(1)%,R1; zi) = x(n—i) + x(n+N-1) 
MPYF3 _ R1,*ARI++(1),R3 ; R3 = w{] * 2] 
| | STF R1,*AR2++(1) ; Store z(i) 
INNER ADDF3 __ R3,R2,R2 ; Accumulate the result for y 


ADDF3 *AR4++(1)%,*AR5——(1)%,R1; For i = N/2 —! 
MPYF3_ —«R1,*AR1— —(IRO),R3 

| | STF R1,*AR2—-—(IRO) 
ADDF3_ R3,R2,R2 ; Include last product 


where AR4 and ARS point to x[0] and x[N—1]. AR1 and AR2 point to w and z array, 
respectively. IRO contains value of N/2 —1. The same instruction codes of weight update 
of transversal filter can be used in symmetric transversal structure by changing the x ar- 
ray pointer to the z array pointer. Appendix B2 presents an example program. The total 
number of instructions needed is 2.5N+15, which is less than that of the transversal 
structure. 


Lattice Structure [6] 


An alternative FIR filter realization is the lattice structure [26]. A discussion of the 
transversal filter with the LMS algorithm shows that the convergence rate of the transver- 
sal structure is restricted by the correlation of signal components; i.e., the eigenvalue spread, 
\max/ Amin: The lattice structure is a decorrelating transform based on a family of predic- 
tion error filters as illustrated in Figure 10. The recursive equations that describe the lat- 
tice predictor are 


fo(n) = bo(n) = x(n) (17a) 
fm(n) = fm—1(n) — kp(n)bm-1(n—-1),0 < m <=M (17b) 
bm(n) = bm-1(n2—-1) — kp(o)fm-1@), 0 << m<=M (17c) 


where f,,(n) represents the forward prediction error, b,,(n) represents the backward predic- 
tion error, k,,(n) is the reflection coefficients, m is the stage index, and M is the number 
of cascaded stages. The lattice structure has the advantage of being order-recursive. This 
property allows adding or deleting of stages from the lattice without affecting the existing 
stages. 


fo(n) fy (n) fmin) 


bmin) 


frm-1 (n) fm (n) 


bm-1(n) bm(n) 


Figure 10. Lattice Structure 


To implement the lattice filter for processing actual data, the reflection coefficients 
k,,(n) are required. These coefficients can be computed according to estimates of the 
autocorrelation coefficients using Durbin’s algorithm. However, it would be more effi- 
cient if these reflection coefficients could be estimated directly from the data and updated 
on a sample-by-sample basis, such as LMS algorithm [6]. The reflection coefficient 
k,(n+1) can be recursively computed [7]: 


kyn(nt+1) = ky(n) + ulfp(Mbm—1(9-1) + ba(Mfn—1@)], 0 < m <= M(18) 


For applications such as noise cancellation, channel equalization, line enhancement, 
etc., the joint-process estimation [3] illustrated in Figure 11 is required. This device per- 
forms two optimum estimations: the lattice predictor and the multiple regression filter. 
The following equations define the implementation of the regression filter 


eg(n) = d(n) — bo(n)go(n) (19a) 
Eem(n) = em—1() —bm_—1(Mgm-1(9), O< m<=M (19b) 
Zm(n+1) = gm(n) + Usp(n)bp (nt), O<=m<=M (20) 


where the LMS algorithm is used to update the coefficients of the regression filter. For 
noise cancellation application, e,,(n) corresponds to the output e(n) in Figure 5. For ap- 
plications such as adaptive line enhancer and channel equalizer, filter output y(n) is ob- 
tained as 


y(n) = a Bm(n) bm(n) (21) 


Figure 11. Lattice Structure with Joint Process Estimation 


TMS320C25/TMS320C30 Implementation 


There are five memory locations—f,(n), b(n), bm(n—1), k(n), and g,,(n)— 
required for each stage. The limitation of on-chip data RAM is 544 words for the 
TMS320C25 and 2K words for the TMS320C30. A maximum of 102 stages can therefore 
be implemented on a single TMS320C25 for the highest throughput. Here, another ad- 
vantage of TMS320C30 architecture design is shown. Since the operands of the mathematic 
operations can be either memory or register on the TMS320C30, and there is no need 
to preserve the values of f,, array for the next iteration (refer to Equations (17) and (18)), 
the fy, array can be replaced by an extended precision register. Thus, for the most effi- 
cient codes, the stage limitation of lattice structure for TMS320C30 is 512, or one-fourth 
of the 2K on-chip RAM. 


Lattice structures have superior convergence properties relative to transversal struc- 
tures and good stability properties; e.g., low sensitivity to coefficient quantization, low 
roundoff noise, and the ability to check stability by inspection. The disadvantages of lat- 
tice filter algorithms are that they are numerically complex and require mathematical 
sophistication to thoroughly understand their derivations. Furthermore, as shown in Ap- 
pendixes C1 and C2, lattice structures cannot take advantage of the TMS320C25 and 
TMS320C30’s pipeline architecture to achieve high throughput. The total number of in- 
struction cycles needed is 33M+32 for TMS320C25 and 14M+4 for TMS320C30. 


Modified LMS Algorithms [5] 


The LMS algorithm described in previous sections is the most widely used algorithm 
in practical applications today. In this section, a set of LMS-type algorithms (all direct 
variants of the LMS algorithm) are presented and implemented. The motivation for each 
is some practical consideration, such as faster convergence, simplicity in implementation, 
or robustness in operation. The description of these algorithms is based on the transversal 
structure. However, these algorithms can be applied to the symmetric transversal struc- 
ture and the lattice structure as well. 


Normalized LMS Algorithm 


The stability, convergence time, and fluctuation of the adaptation process is governed 
by the step size u and the input power to the adaptive filter. In some practical applica- 
tions, you may need an automatic gain control (AGC) on the input to the adaptive filter. 
The normalized LMS algorithm is one important technique used to improve the speed of 
convergence. This is accomplished while maintaining the steady-state performance indepen- 
dent of the input signal power. This algorithm uses a variable convergence factor u(n), 
which represents a u that is a function of the time index, 


u(n) = a/ var(n) (22) 


and 
w(n+1) = w(n) + u(ne(n)x(n) (23) 


where a is a convergence parameter, and var(n) is an estimate of the input average power 
at time n using the recursive equation 


var(n) =(1 — 5) var(n—1) + b x2 (n) (24) 


where 0 < b << 1 is a smoothing parameter. In practice, a is chosen equal to b. 


For fixed-point processors, there is a way to reduce the computation of power estima- 
tion. Since b in Equation (24) doesn’t have to be an exact number, it is computationally 
convenient to make b a power of 2. If b = 2—™, the multiplication of b can be implemented 
by shifting right m bits. Therefore, the var(n) in Equation (24) is computed by 


var(n—1) — b var(n—1) + Db x2(n) 
var(n—1) — var(n—1) * 2—m + x2(n) * 2-m 


var(n) 


Then, assuming the variance var(n) of input signal is stored in the data memory 
VAR and its initial value is 0.99997 (= 1— 2-15), The implementation of this equation 
using TMS320C25 assembly code is 


LARP AR3 

LRLK AR3,FRSTAP _ ; Point to input signal x 
SQRA * ; Square input signal 
SPH ERRF 

ZALH VAR ; ACC = var(n—1) 


SUB VAR,SHIFT ; ACC = (1-—)) var(n—1) 
ADD ERRF,SHIFT ; ACC = (1—b) var(n—1) + b x2(n) 
SACH VAR ; Store var(n) 


The normalized LMS algorithm can be implemented as 


var = b, * var + b * xn{O] * xn(0]; 
unen = e[n] * a/ var; 

for (i = 0; i< N; i++) 

wn{i] += unen * xnfiJ; 


where b; = (1—b), xn[{O] = x(n), and unen = u(n)*e(n). This normalized technique 
reduces the dependency of convergence speed on input signal power at the cost of in- 
creased computational complexity, especially the division in Equation (22). The algorithms 
of implementing the fixed-point and floating-point division on the TMS320C25 and 


TMS320C30 can be found in the user’s guide for each device [13, 14]. Since the power 
of input signal is always positive, those codes can be simplified to save computation time. 


Since the power estimation in Equation (24) and step size normalization in Equation (22) 
are performed once for each sample x(n), the computation increase can be ignored when 
N is large. As shown in Appendixes D1 and D2, the total number of instruction cycles 
needed for the normalized LMS algorithm (7N+57 for the TMS320C25 and 3N +47 for 
the TMS320C30) is slightly higher than for the LMS algorithm (7N+34 and 3N+15) 
when N is large. ) 


Sign LMS Algorithms 


The LMS algorithm requires 2N multiplications and additions for each iteration: 
this amount is much lower than the requirements for many other complicated adaptive 
algorithms, such as Kalman and Recursive Least Square (RLS) [3]. However, there are 
three simplified versions of the LMS algorithm (sign-error LMS, sign-data LMS, and sign- 
sign LMS) that save the number of multiplications required and extend the real-time band- 
width for some applications [5, 27]. 


First, the sign-error LMS algorithm can be expressed as 
w(n+1) = w(n) + u sign[e(n)] x(n) (25) 


where signfe(n)] = 1, if e(n) >0 
—1 ,if e(n) <0 


The C program implementation of sign-error LMS algorithm is 


tu = U5 
if (e[n] < 0.) { 
tu = —u; } 


for (i=0; i<N; i++) { 
wn{i] += tu * xn{i]; 
} 


As shown in Appendixes E1 and E2, the instruction sequence to implement weight 
update with the sign-error LMS algorithm is identical to that with the LMS algorithm. 
The difference is that the sign-error LMS algorithm uses the sign [e(n)]*u instead of e(n)*u 
before the update loop. Note that, for fixed-point processors, if u is chosen to be a power 
of two, the u x(n) can be accomplished by shifting right the elements in x(n). This algorithm 
keeps the same convergence direction as the LMS algorithm. Thus, the sign-error LMS 
algorithm should remain efficient, provided the variable gain u(n) is matched to this change. 
However, the use of constant step size u to reduce computation comes at the expense of 
a slow convergence rate since smaller u is normally used for stability reasons. 


The programs in Appendixes E1 and E2 implement a transversal filter with sign- 
error LMS algorithm in looped code. The total number of instruction cycles needed for 
this algorithm using the TMS320C25 is 7N+26, which is slightly less than for the LMS 
algorithm’s 7N +28. Computing u*e(n) takes 5 instruction cycles. The sign-error LMS 
algorithm determines the sign of the u by checking the sign of e(n), which takes only 3 
instruction cycles. The total number of instruction cycles needed for the sign-error LMS 
algorithm using the TMS320C30 is 3N+16, which is slightly higher than for the LMS 
algorithm. This occurs because the TMS320C30 takes only one instruction cycle to com- 
pute u*e(n) and two instruction cycles to determine the sign of the u. 


Secondly, the sign-data LMS algorithm is 
w(n+1) = win) + u e(n) sign[x (n)] (26) 
This equation can be implemented as 


wi(n+1) = win) + ue(n) , if x(n—i) >= 0 
= wj(n) — ue(n) , if x(n—i) <0 


for i=0,1,...,N—1. Since the sign determination is required inside the adaptation loop 
_ to determine the sign of x(n—i), slower throughput is expected. The total number of in- 
struction cycles needed is 11N +26 for the TMS320C25 and 5N + 16 for the TMS320C30. 


Finally, the sign-sign LMS algorithm is 
w(n+1) = w(n) + u sign[e(n)] sign[x(n)] (27) 


which requires no multiplications at all and is used in the CCITT standard for ADPCM 
transmission. As we can see from the above equations, the number of multiplications is 
reduced. This simplified LMS algorithm looks promising and is designed for VLSI or 
discrete IC implementation to save multiplications. 


The sign-sign LMS algorithm can be implemented as 


for (i=0; i<N; i++) { 
if (e[n] >= 0.) { 
if (xn{i] >= 0.) 
wn{i] += u; 
else 


if (xn{i]J> = 
wn{i] — 


lod 


else 
wn{i] += u; } } 


When this algorithm is implemented on TMS320C25 and TMS320C30 with pipeline 
architecture and a parallel multiplier, the performance of sign-sign LMS algorithm is poor 
compared to standard LMS algorithm due to the determination of sign of data, which can 
break the instruction pipeline and can severely reduce the execution speed of the processors. 


In order to avoid double branches inside the loop, the XOR instruction is utilized 
to check the sign bit of e(n) and x(n—i). The sign-sign LMS algorithm can be implemented 
as 


wi(n+1) = w,(n) + u, if sign[e(n)] = sign[{x(n—1)] 


w,(n) — u , otherwise 


The following TMS320C25 instruction sequence implements this algorithm without 
branching (assuming that the current address register used is AR3): 


LRLK ARI,N-1 ; Set up counter 
LRLK AR2,COEFFD ; Point to w;(n) 
LRLK AR3,LASTAP+1 _ ; Point to x(n—i) 
ADAP LAC *— 0,AR2 ; Load x(n—1) 
XOR ERR > XOR with e(n) 
SACL  ERRF ; Save sign bit, sign = 0 if same signs 
; Sign = 1 if different signs 
LAC ERRF ; Sign extension to ACCH, 


; ACCH = OIf ERRF > = 0 
; ACCH = OFFFFh if ERRF < 0 


XORK MU,I15 ; Take one’s complement of m 
; If sign = 1 

ADD * 15 ; Weight update 

SACH *+ 1,AR1 ; Save new weight 


BANZ ADAP,*—,AR3 


The one’s complement of u is used instead of —u, because they are only slightly 
different and the step size does not require the exact number. The weight update with 
this technique requires 10N instruction cycles and FIR filtering requires N instruction cycles 
so that the total number of instruction cycles needed is 11N+21. The complete TMS320C25 
assembly program is given in Appendix F1. 


To determine whether a positive or negative u should be used without branching 
is trickier in the TMS320C30. Fortunately, the extended precision registers of TMS320C30 
interpret the 32 most-significant bits of the 40-bit data as the floating-point number and 
the 32 least-significant bits of the 40-bit data as an integer. When a floating-point number 


changes its sign, its exponent remains the same. Therefore, the sign of step size u can 
be determined by using XOR logic on its mantissa. The following code shows how the 
sign-sign LMS algorithm is implemented on the TMS320C30. 


ASH —31,R7 ; R7 = Sign[e(n)] 

XOR3_ RO,R7,R5 ; RS = Sign[e(n)] * u 

LDF *ARO++(1)%,R6 ; R6= x(n) 

ASH —31,R6 ; R6 = Sign[x(n—1)] 

XOR3_  RS5,R6,R4 ; R4 = Sign[x(n—i)}*Sign[e(n)] * u 
ADDF3 *ARI1,R4,R3 ; R3 = w,(n) + R4 

LDI order —3,RC ; Initialize repeat counter 

RPTB SSLMS ; Doi = 0, N-3 


LDF *ARO++(1)%,R6 ; Get next data 
|| STF R3,*AR1++(1)% _ ; Update wj(n+1) 


ASH —31,R6 ; Get the sign of data 
XOR3_ RS5,R6,R4 ; Decide the sign of u 
SSLMS ADDF3 *AR1,R4,R3 ; R3 = w;(n) + R4 

LDF *ARO,R6 * Get last data 

| | STF R3,*ARI++(1)% ; Update wn-2(n+1) 
ASH —31,R6 ; Get the sign of data 
XOR3_ —sCRS, R6,R4 ; Decide the sign of u 
ADDF3 *AR1,R4,R3 ; Compute wy_-;(n+1) 


STF R3,*ARI1++(1)% _ ; Store last w(n+1) 


Here, RO, R4, and RS contain the value of u before updating. ARO and AR1 point 
to x array and w array, respectively. R7 contains the value of error signal e(n). The com- 
plete program is given in Appendix F2. The total number of instruction cycles is 5N + 16, 
which is much higher than LMS algorithm. 


The sign-sign LMS algorithm is developed to reduce the multiplication requirement 
of the LMS algorithm. Since DSPs provide the hardware multiplier as a standard feature, 
this modification does not provide any advantage when implementing this algorithm on 
the DSPs. On the contrary, it causes some disadvantages since decision instructions will 
destroy the instruction pipeline. If you use the XOR logic operation in order to avoid us- 
ing the decision instructions, the complexity of the program will be increased and the total 
number of instruction cycles will be greater than the regular LMS algorithm. 


Leaky LMS Algorithm 


When adaptive filters are implemented on signal processors with fixed word lengths, 
roundoff noise is fed back to adaptive weights and accumulates in time without bound. 
This leads to an overflow that is unacceptable for real-time applications. One solution is 


based upon adding a small forcing function, which tends to bias each filter weight toward 
zero. The leaky LMS algorithm has the form 


w(n+1) = r w(n) + u e(n) x(n) (28a) 


where r is slightly less than 1. 


Since r can be expressed as 1 — c andc < <1, the TMS320C25 can take advantage 
of the built-in shifters to implement this algorithm. Therefore, Equation (28a) can be 
changed to 


w(nt+1) = win) — c w(n) + u e(n) x(n) (28b) 


In order to achieve the highest throughput by using ZALR and MPYA, cw(n) can 
be implemented by shifting w;(n) right by m bits where 2—™ is close to c. Since the length 
of the accumulator is 32 bits and the high word (bits 16 to 31) is used for updating w(n), 
shifting right m bits of w;(n) can be implemented by loading w;(n) and shifting left 
16 — m bits. The sequence of TMS320C2S5 instructions to implement Equation (28b) is 
shown as 


LRLK ARI,N-1 ; Set up counter 
LRLK AR2,COEFFD ; Point to w;(n) 
LRLK AR3,LASTAP+1 — ; Point to x(n — i) 
LT ERRF ; T = ERRF —=u¥*e(n) 


MPY *— AR2 
ADAPT ZALR ~ *,AR3 
MPYA *-—,AR2 
SUB * LEAKY ; LEAKY=16—m 
SACH *+,0,AR1 
BANZ ADAPT,*—,AR2 


For each iteration, 7N instruction cycles are needed to perform the adaptation pro- 
cess (6N for the LMS algorithm). The total number of instruction cycles needed is 8N+28 
(see Appendix G1 for the complete program). The leaky factor r has the same effect as 
adding a white noise to the input. This technique not only can solve adaptive weights 
overflow problem, but also can be beneficial in an insufficient spectral excitation and stalling 
situation [5]. : 


The method used above is especially for the TMS320C25, which has a free shift 
feature. Since TMS320C30 is a floating-point processor, r can simply multiply to filter 
coefficient. However, in order to reduce the instruction cycles, this multiplication can 
combine with another instruction to be a parallel instruction inside the loop. The follow- 
ing code shows how to rearrange the instructions from the LMS algorithm to include this 
multiplication without an extra instruction cycle. 


LLMS 


MPYF 
MPYF3 
MPYF3 

| | ADDF3 
LDI 
RPTB 
MPYF3 

| | ADDF3 
MPYF3 
| | STF 


MPYF3 
| | ADDF3 

MPYF3 
| | STF 

MPYF3 
| | ADDF3 


MPYF3 
| | STF 
STF 


@u__r,R7 ; R7 = e(n)*u/r 
*ARO+ +(1)%,R7,R1 ; Rl = e(n)*u*x(n)/r 
*ARO+ +(1)%,R7,R1 ; Rl = e(n)*u*x(n—-1)/r 


*AR1,R1,R2 ; R2 = wo(n) + e(n)*u*x(n)/r 
order —4,RC ; Initialize repeat counter 

LLMS ; doi = 0, N-—4 

*AR2,R2,RO ; RO = r*w,(n) + e(n)*u*x(n—i) 


*+ AR1(1),R1,R2 ; R2 = wj4,(n) + e(n)*u*x(nz—i—1)/r 
*ARO+ +(1)%,R7,R1 ; Rl = e(n)*u*x(n—i-—2)/r 
RO,*AR1+ +(1)% ; Store wj(n+1) 


*AR2,R2,RO ; RO = r*wn_3(n) + e(n)*u*x(n—N+3) 
*+ARI(1),RI,R2 ; R2 = wyn_2(n) + e(n)*u*x(n—N+2)/r 
*ARO,R7,R1 ; Rl = e(n)*u*¥xQn—N+1)/r 

RO,*AR1 + +(1)% ; Store wn—3(n+ 1) 

*AR2,R2,RO ; RO = r*w,(n) + e(n)*u*x(n—N+2) 


*+ ARI(1),R1,R2 ; R2 = wy-1(n) + 

: e(n)*u*x(n—N+1)/r 

; RO = r*w,(n) + e(n)*u*x(n—N+1) 
; Store wy—2(n+ 1) 

; Update last w 


*AR2,R2,RO 
RO,*ARI1 + +(1)% 
RO,*AR1+ +(1)% 


Auxiliary registers ARO and ARI point to x and w arrays. AR2 points to the memory 
location that contains value r. R7 contains the value of error signal e(n). R1 and R2 are 
updated before the loop because the parallel instructions inside the loop use the previous 
values in R1 and R2. Note that R1 is updated twice before the loop because the updating 
of R2 requires the previous value of R1. In order to update x array pointer to the new 
beginning of the data buffer for next iteration, two of the loop instruction sets have been 
taken out of loop and modified by eliminating the incrementation of ARO. The TMS320C30 
assembly program of an adaptive transversal filter with the leakage LMS algorithm is listed 
in Appendix G2 as an example. The total number of instruction cycles for this algorithm 
is 3N+15, which is the same as the LMS algorithm. This example shows the power and 
flexibility of the TMS320C30. 


Implementation Considerations 


The adaptive filter structures and algorithms discussed previously were derived on 
the basis of infinite precision arithmetic. When implementing these structures and algorithms 
on a fixed integer machine, there is a limitation on the accuracy of these filters due to 
the fact that the DSP operates with a finite number of bits. Thus, designers must pay at- 
tention to the effects of finite word length. In general, these effects are input quantization, 
roundoff in the arithmetic operation, dynamic range constraints, and quantization of filter 
coefficients. These effects can either cause deviations from the original design criteria 
or create an effective noise at the filter output. These problems have been investigated 
extensively, and techniques to solve these problems have been developed [28, 29]. 


The effects of finite precision in adaptive filters is an active research area, and some 
significant results have been reported [30 through 32]. There are three calevones of finite 
word length effects in adaptive filters: 


e Dynamic Range Constraint (scaling to avoid overflow). Since this is not 
applicable for a floating-point processor, the TMS320C30 is not mentioned 
in this portion. 


e Finite Precision Errors (errors introduced by roundoff in the arithmetic). 
e Design Issues (design of the optimum step size u that minimizes system 
noise). 


Dynamic Range Constraint 
As shown in Figure 1, the most ey used LMS transversal filter is specified by 
the difference equations 
N-1 
y(n) = fa wj(n) x(n—i) (29) 
1 = 


and 
wj(n+1) = wj(n) + u*e(n)*x(n—i), for i = 0, 1, ..., N—1 (30) 


where x(n—1) is the input sequence and w,(n) are the filter coefficients. 


If the input sequence and filter coefficients are properly normalized so that their 
values lie between —1 and 1 using Q15 format, no error is introduced into the addition. 
However, the sum of two numbers may become larger than one. This is known as overflow. 
The TMS320C25 provides four features that can be applied to handle overflow manage- 
ment [13]: 


A. Branch on overflow conditions. 

B. Overflow mode (saturation arithmetic). 
C. Product register right shift. 

D. Accumulator right shift. 


One technique to inhibit the probability of overflow is scaling, i.e., constraining 
each node within an adaptive filter to maintain a magnitude less than unity. In Equation 
(29), the condition for |y(n)| <1 is 


N-1 
Xmax < 1 / = Iwi(n)| (31) 
1= 


where Xmax denotes the maximum of the absolute value of the input. The right shifter 
of the TMS320C25, which operates with no cycle overhead, can be applied to implement 
scaling to prevent overflow of multiply-accumulate operations in Equation (29). By set- 
ting the PM bits of status register ST1 to 11 using the SPM or LST1 instructions, the 
P register output is right-shifted 6 places. This allows up to 128 accumulations without 
the possibility of an overflow. SFR instruction can also be used to right shift one bit of 
the accumulator when it is near overflow. 


Another effective technique to prevent overflow in the computation of Equation (29) 
is using saturation arithmetic. As illustrated in Figure 12, if the result of an addition 
overflows, the output is clamped at the maximum value. If saturation arithmetic is used, 
it is common practice [28] to permit the amplitude of x(n—i) to be larger than the upper 
bound given in Equation (31). Saturation of the filter represents a distortion, and the choice 
of scaling on the input depends on how often such distortion is permissible. The satura- 
tion arithmetic on the TMS320C25 is controlled by the OVM bit of status register STO 
and can be changed by the SOVM (set overflow mode), ROVM (reset overflow mode), 
or LST (load status register). 


output 


input 


Figure 12. Saturation Arithmetic 


Filter coefficients are updated using Equation (30). As illustrated in Figure 13, a 
new technique presented in reference 31 uses the scaling factor a to prevent filter’s coeffi- 
cients overflow during the weight updating operation. Suppose you use a = 2—™. A right 
shift by m bits implements multiplication by a, while a left shift by m bits implements 
the scaling factor 1/a. Usually, the required value of a is not expected to be very small 
and depends on the application. Since a scales the desired signal, it does not affect the 
rate of convergence. 
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Figure 13. Fixed-Point Arithmetic Model of the Adaptive Filter 
Finite Precision Errors 
The TMS320C25 is a 16/32-bit fixed point processor. Each data sample is represented 
by a fractional number that uses 15 magnitude bits and one sign bit. The quantization interval 
§ = 2-5, (32) 


(b = 15), is called the width of quantization since the numbers are quantized in steps of 6. 


The products of the multiplications of data by coefficients within the filter must be 
rounded or truncated to store in memory or a CPU register. As shown in Figure 14, the 
roundoff error can be modeled as the white noise injected into the filter by each rounding 
operation. This white noise has a uniform distribution over a quantization interval and 
for rounding 


—- 1/26<e $1/26 (33a) 


and 
6.2 = (1/12) & (33b) 
where 6,2 is the variance of the white noise. 


In general, roundoff noise occurs after each multiplication. However, the 
TMS320C25 has a full precision accumulator, i.e., a 16 < 16-bit multiplier with a 32-bit 
accumulator, so there is no roundoff when you implement a set of summations and 
multiplications as in Equation (29). Rounding is performed when the result is stored back 
to memory location y(n), so that only one noise source is presented in a given summation 
node. — 


y = Rounding [x e a] = x e a + @ 


Figure 14, Fixed-Point Roundoff Noise Model 


For floating-point arithmetic, the variance of the roundoff noise [31] is slightly dif- 
ferent from Equation (33b), 


0,7 = 0.18 82 (33c) 


Since TMS320C30 has a 40/32-bit floating-point multiplier and ALU, the result from 
arithmetic operation has the mantissa of [31] bits plus one sign bit. Therefore, the 6 in 
Equation (33c) is equal to 2~3!. Another roundoff noise is introduced when you restore 
the result back to memory. This noise has the power of 2—23 because the mantissa of 
TMS320C30 floating-point data is 23 bits plus one sign bit. Therefore, unless the filter 
order is high, the roundoff noise from arithmetic operation is relatively small. 


The steady-state output error of the LMS algorithm due to the finite precision 
arithmetic of a digital processor was analyzed in reference [31]. It was found that the power 
of arithmetic errors is inversely proportional to the adaptation step size u. The significance 
of this result in the adaptive filter design is discussed next. Furthermore, roundoff noise 
is found to accumulate in time without bound, leading to an eventual overflow [32]. The 
leaky LMS algorithm presented in the previous section can be used to prevent the algorithm 
overflow. 


Design Issues 


The performance of digital adaptive algorithms differs from infinite precision adap- 
tive algorithms. The finite precision LMS algorithm is given as | 


w(n+1) = wn) + Q[u*e(n)*x(n)] (34) 


where Q [.] denotes the operation of fixed point quantization. Whenever any correction 
term u*e(n)*x(n—1i) in the update of the weight vector in Equation (34) is too small, the 
quantized value of that term is zero, and the corresponding weight w;(n) remains unchang- 
ed. The condition for the ith component of the vector w(n) not to be updated when the 
algorithm is implemented with the TMS320C25 is 


| u e(n) x(n—i) | <6/2 (35a) 
where 6= 2-15. The condition for TMS320C30 is 
| u e(n) x(n—i) | < 2exp * 8/2 (35b) 


where exp is the exponent of w;(n) and 6= 2-23. 


Since the adaptive algorithms are designed to minimize the mean squared value of 
the error signal, e(n) decreases with time. If u is small enough, most of the time the weights 
are not updated. This early termination of the adaptation may not allow the weight values 
to converge to the optimum set, resulting in a mean square error larger than its minimum 
value. The conditions for the adaptation to converge completely [30] is u > Umi, where 


62 
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for the TMS320C25 and the TMS320C30 
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where 0,7 is the power of input signal x(n) and é€mjin is the minimum mean squared error 
at steady state. 


In the Leaky LMS Algorithm section, it was mentioned that the excess MSE given 
in Equation (14) is minimized by using small u. However, this may result ina large quan- 
tization error since the most significant term in the total output quantization error is [31] 


. No,” 


2 a2 u 0 

The optimum step size ug reflects a compromise between these conflicting goals. 

The value of up is shown to be too small to allow the adaptive algorithm to converge com- 

pletely and also to give a slow convergence. In practice, u > ug is used for faster con- 

vergence. Hence, the excess MSE becomes larger, and the roundoff noise can typically 
be neglected when compared with the excess mean square error. 


Finally, recall Equations (11) and (12). The step size u has an upper limit to guarantee 
the stability and convergence. Therefore, the adaptive algorithm requires 
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On the other hand, the step size u also has a lower limit. The optimum uo, which 
minimizes the sum of the excess MSE and roundoff noise, is smaller than up jn, 1.¢., too 
small to allow the adaptive weight to converge. For an algorithm implemented on the 
TMS320C25, the word-length of 16 bits is fixed, and the minimum step-size that can be 
used is given in Equation (36). The most important design issue is to find the best u to satisfy 


1 


Umin < Uu < 
Nox 


; (39) 


Therefore, in order to make the condition in Equation (39) valid, the initial values 
of filter coefficients are better close to zero for the floating-point processor if the situation 
in unknown. 


Software Development 


The TMS320C25 and TMS320C30 combine the high performance and the special 
features needed in adaptive signal processing applications. The processors are supported 
by a full set of software and hardware development tools. The software development tools 
include an assembler, a linker, a simulator, and a C compiler. The most universal soft- 
ware development tool available is a macro assembler. However, the assembly language 
programming for DSP can be tedious and costly. For adaptive filter applications, an 
assembly language programmer must have knowledge of adaptive signal processing. The 
challenge lies in compressing a great deal of complex code into the fairly small space and 
most efficient code dictated by the real-time applications typical of adaptive signal pro- 
cessing. 


Recently, C compilers for the processors were developed to make DSP program- 
ming easier, quicker, and less costly compared with the work associated with program- 
ming in assembly language. Due to the general characteristics of a compiler, the code 
it generates is not the most efficient. Since the program efficiency consideration is impor- 
tant for adaptive filter implementation, the code generated from the C compiler has to 
be modified before implementing. Thus, two alternative ways, besides writing an assembly 
program, to implement adaptive signal processing on DSP are presented. First is the 
automatic adaptive filter code generator [12], which can be found on Texas Instruments 
TMS320 Bulletin Board Service (BBS), and second are the adaptive filter function libraries 
that support assembly and C programming languages. 


In this report, two adaptive filter libraries have been developed: one can be called 
from an assembly main program; the other can be called from the C main program. Note 
that, for the TMS320C25 only, certain data memory locations have been reserved for storing 
the necessary filter coefficients, previous delayed signal, etc. In other words, these data 
memories are used as global variables. 


Assembly Function Libraries 


The basic concept of creating an assembly subroutine for an adaptive filter is to modify 
in module the assembly programs discussed above. Then, the user can implement the adap- 
tive filter by writing his own assembly main program that calls the subroutine. 


TMS320C25 Assembly Subroutine 


The TMS320C25 has an eight-level deep hardware stack. The CALL and CALA 
subroutine calls store the current contents of the program counter (PC) on the top of the 
stack. The RET (return from subroutine) instruction pops the top of the stack back to the 
PC. For computational convenience, the processor needs to be set as follows before call- 
ing the assembly callable subroutine. 


1. PM status bits equal to 01. 
2. SXM status bit set to 1. 
3. The current DP (data memory page pointer) is 0. 


The following example is the TMS320C25 assembly main routine, which performs 
an adaptive line enhancement by calling the LMS algorithm subroutine. The filter order 
is 64, delay is equal to one, and the convergence factor u is 0.01. 


DEFINE AND REFER SYMBOLS 


.global ORDER,U,ONE,D,Y,ERR,XN,WN,LMS 


DEFINE SAMPLING RATE, ORDER, AND MU 
* 
ORDER: _ .equ 20 
MU: equ 327 > mu = 0.01 in QI5 format 
PAGEO: .equ 0 
* 


DEFINE ADDRESSES OF BUFFER AND COEFFICIENTS 


XO: .usect ‘‘buffer’’,ORDER-— 1 
XN: -usect ‘‘buffer’’,1 
WN: -usect ‘‘coeffs’’,ORDER 


. RESERVE ADDRESSES FOR PARAMETERS 


ONE: .usect ‘‘parameters’’,1 
U: .usect ‘‘parameters’’,1 
ERR: .usect ‘‘parameters’’,1 
b .usect ‘“‘parameters’’,1 
D: .usect ‘‘parameters’’,1 
ERRF: .usect ‘“‘parameters’’,1 


* 
= INITIALIZATION 


* 


START LDPK PAGEO ; Set DP = 0 
SPM 1 ; Set PM equal to 1 
SSXM ; Set sign extension mode 
LRLK AR7,X0 ; AR7 point to >300 
LACK 1 ; Initialize ONE = 1 
SACL ONE 
LALK MU ; Initialize U = MU = 0.01 


SACL. U 
2fe fe 2 oie 2 fe 246 2k ie 2c 2 fe 2k 2k 246 ie fe 2 ag fe 2c 2g 246 2 ig oie i 2fe 2k ik ie 2 ik 2c a fe 2k Fg oi 24k ik Fe 2K a 24 2k 2 ig IC 2g 2c oe ik ic ok fe ic ic i ic 2 fc oie 2 fe ok ic 2 2k ok 


. PERFORM THE PREDICTOR 


He oe 2 2h 9 he 2c 2 2h 2 ie 2 i ie 2 2g 2c 2K ig 4c 2k ie 2K 2 ie oie 2 2c 2k 246 2 2 ie 2 ig ie 2 24g 2g 2 2k 2 ig Fe 2 Fg ik oI 2k 2k ik ik 2k og 2 2c kof 2c ok fe 2k ok 24k IC 2 oie oie ok 2k 2k ok 


INPUT: IN D,PA2 ; Get the input 
* 
CALL LMS > Call subroutine 
* 
OUTPUT: OUT  Y,PA2 ; Output the signal 
* 
LAC D ; Insert the newest sample 
LARP AR7 
SACL * 
B INPUT 
.end 


ee en ne ae a ae ge me _ TA LOaIaAnMKanh ww. 


The symbols, such as ORDER, U, ONE, D, LMS, Y, and-ERR, are defined and 
referred to for the purpose of modular programming. The uninitialized sections specified 
by the directive .usect can be placed in any location of memory according to the linker 
command file. Note that MACD instruction requires the sources of the operands on pro- 
gram memory and data memory separately, and CNFP instruction configures RAM block 
0 as program memory. Therefore, the coeffs section has to be in data RAM block 0, and 
the buffer has to be in RAM block 1. Appendix H1 contains the adaptive transversal filter 
with LMS algorithm subroutine using the TMS320C25, and Appendix H2 contains an 
example of a linker command file. 


TMS320C30 Assembly Subroutine 


Instead of a hardware stack, TMS320C30 uses a software stack, which is more flex- 
ible and convenient for a high-level language compiler. The stack memory location is 
pointed to by the stack pointer SP. In order to maintain the proper program sequence, 
the programmer must make certain that no data is lost and that the stack pointer always 
points to proper location. The PUSH, PUSHF, POP, POPF, CALL, CALLcond, RETI- 
cond, and RETScond instructions will change the value of the stack pointer; in addition, 
writing data into it and using the interrupt will also change that value. It is the program- 
mer’s responsibility to initialize the stack pointer in the beginning of the program. The 
same adaptive line enhancer example above using TMS320C30 is listed below. The 
adapfitr.int program that initializes the stack pointer and the data RAM is given in Appen- 
dix H3. | 


* 


DEFINE GLOBAL VARIABLES AND CONSTANTS 


copy ‘‘adapfitr.int’’ 
.global LMS30,order,u,d,y,e 
N .set 20 
mu .set 0.01 


INITIALIZE POINTERS AND ARRAYS 


text 

begin .Set $ 
LDI N,BK ; Set up circular buffer 
LDP @xn__addr ; Set data page 


LDI @xn__addr,ARO ; Set pointer for x[] 
LDI @wn__addr,AR1 _ ; Set pointer for w{] 
LDF _‘0.0,RO ; RO = 0.0 

RPTS N-1 

STF RO,*ARO++(1)% ; x[{] = 0. 


| |STF RO,*AR1++(1)% ; w{] = 0. 
LDI @in__addr, AR6 ; Set pointer for input ports 
LDI @out__addr,AR7 __ ; Set pointer for output ports 


* PERFORM ADAPTIVE LINE ENHANCER 


nput: 
LDF *AR6,R7 ; Input d(n) 
| |LDF *+AR6(1),R6 ; Input x(n) 
STF R7,@d ; Insert d(n) 
STF R6,*ARO ; Insert x(n) to buffer 
* 
- CALL ASSEMBLY SUBROUTINE 
*K 
* CALL LMS30 
* OUTPUT y(n) AND e(n) SIGNALS 
oe 
LDF @y,R6 ; Get y(n) 
BD input ; Delay branch 
LDF @e,R7 ; Get e(n) 
STF R6,*AR7 ; Send out y(n) 
STF R7,*+AR7(1) > Send out e(n) 
x* 
* DEFINE CONSTANTS 
* 
n .usect ‘‘buffer’’,N 


wn -usect ‘‘coeffs’’,N 
in__addr -usect ‘‘vars’’,1 
out__addr .usect ‘‘vars’’,1 
xn__addr .usect ‘‘vars’’,1 
wn__addr_ .usect ‘‘vars’’,1 


u -usect ‘‘vars’’,1 
order -usect ‘‘vars’’,1 
d -usect ‘‘vars’’,1 
y -usect ‘‘vars’’,1 
e .usect ‘‘vars’’,1 
cinit sect ** cinit”’ 


.word  6,in__addr 
.word 0804000h 
.word 0804002h 
.word xn 

.word wn 


float mu 
.word N-2 
.end 


In the above example, data memory order is initialized to N—2 for computation conve- 
nience. The linker command files and the subroutine that implements the LMS transver- 
sal filter can be found in Appendixes H4 and HS. 


C Function Libraries 


The TMS320C25 and TMS320C30 C language compilers provide high-level language 
support for these processors. The compilers allow application developers without an ex- 
tensive knowledge of the device’s architecture and instruction set to generate assembly 
code for the device. Also, since C programs are not device-specific, it is a relatively 
straightforward task to port existing C programs from other systems. 


To allow fast development of efficient programs for adaptive signal processing ap- 
plications, C function libraries have been developed. These libraries include functions for 
adaptive transversal, symmetric transversal, and lattice structures. 


TMS320C25 C-Callable Subroutines 


In a C program, the memory assignments are chosen by the compiler. There are 
two ways to use the most efficient instruction MACD: 


A. Use inline assembly code to assign memory locations for filter coefficients and 
buffers. 

B. Reserve the desired memory locations for them and do the assignment in the 
linker command file. | 


The latter method is used in this report. 


For a C main program, the parameters passed to and returned from the subroutines 
are all within the parentheses following the subroutine name, as shown below: 


Ims(n,mu,d,x,&y,&e) n - Filter order 
mu - Convergence factor 
d - Desired signal 
x - Input signal 
y - Address of output signal 
e - Address of error signal 


Since the TMS320C25 C compiler pushes the parameters from right to left into soft- 
ware stack pointed by ARI , the subroutine gets the parameters in reverse order, as shown 
below: 


MAR = ; Set pointer for getting parameters 
LAC ae ; ACC = N 


SUBK 1 


SACL ORDER ; ORDER = N —- 1 

LAC sito ; Getting and storing the mu 
SACL U 

LAC ios ; Getting and storing the D 
SACL D 


LAC *— 0 A—R3__ ; Insert the newest sample 
LRLK AR3,FRSTAP 
SACL * 


The assembly subroutine returns the parameters y and e as follows: 


LARP ARI 

LAR AR2,*—,AR2_ ; Get the address of y in main 
LAC Y 

SACL *,0,AR1 ; Store y 

LAR AR2,*,AR2 - Get the address of e in main 
LAC ERR 

SACL *,0,AR1 ; Store e 


Therefore, the parameters should be entered in the order given above. If there are 
other parameters, they should be inserted right after the convergence factor mu. The leaky 
LMS algorithm subroutine is given as an example. 


llms(n,mu,r,d,x,&y,&e) 


the r is defined in Equation (28a). Note that the values of the AR registers, which will 
be used in subroutine, and the status registers must be saved at the beginning of the 
subroutine and restored right before returning to calling routine. An example of a C-callable 
program is given in Appendix I1. Memory locations 0200h to 0200h+N-—1 and 0300h 
to 0300h-+N—1 are reserved for filter coefficients and buffers, respectively. N denotes 
the filter order. 


TMS320C30 C Subroutine 


As previously mentioned, the TMS320C30 architecture has features designed for 
a high-level language compiler. Note that the callable word is dropped in this section title 
because the TMS320C30 is so flexible that the restrictions for the TMS320C25 no longer 
exist. Since the memory locations of filter buffers and coefficients are determined by the 
parameters that pass from the calling routine, the same subroutine can be used in different 
places. However, the only restriction is that the memory locations of filter buffers must 
align to the circular addressing boundary [14]. The features of TMS320C30 architecture 
that make a major contribution toward these improvements are dual data address buses, 
software stack, and flexible addressing mode. The parameters passed to subroutine are 
pushed into the stack. Therefore, after returning from the subroutine, the stack pointer, 
SP, must be updated to point to the location where SP pointed before pushing the parameters 


into the stack. However, this will be done by the C compiler. The usage example of the 
C function subroutine is given as follows: 


tlms(n,u,d,&w,&x,&y,&e) where n - Filter order 
u - Step size 
d - Desired signal 
&w - Filter coefficients 
&x - Input signal buffers 
&y - Addr of output signal 
&e - Addr of error signal 


The example below shows how the C subroutine receives and manipulates the 
parameters passed from the caller program and how the result is returned to the caller 
routine. 


SET FRAME POINTER FP 
* 
FP .set AR3 

PUSH FP 

LDI SP,FP 


GET FILTER PARAMETERS 

LDI *—FP(2),R4 ; Get filter order 

LDI *—FP(6),ARO  ; Get pointer for x[] 

LDI *— —FP(5),ARI1 ; Get pointer for w[] 

* 

* COMPUTE ERROR SIGNAL e(n) AND STORE y(n) AND e(n) 


LDI *—FP(2),AR2 _; Get y(n) address 
SUBF3 R2,*+FP(1),R7 ; e(n) = d(n) — y(n) 


| |STF R2,*AR2 ; Send out y(n) 
LDI *—FP(3),AR2 _; Get e(n) address 
STF R7,*AR2 ; Send out e(n) 
MPYF *+FP(2),R7 ; R7 = e(n) * u 
POP FP 


Note that AR3 is used as the frame pointer in TMS320C30 C compiler. Appendix 
[2 contains the complete LMS transversal filter example subroutine program. 


Development Process and Environment 


Following a four stage procedure [33] to minimize the amount of finite word length 
effect analysis and real-time debugging, adaptive structures and algorithms are implemented 


on the TMS320C25. Figure 15 illustrates the flowchart of this procedure. Since the im- 
plementation on TMS320C30 is done only by the simulator, the last stage, real-time testing, 
is not implemented. 


Algorithm Analysis 
and C Program 


- Implementation 


Re-write C Program 
to Emulate 
DSP Sequence 


implement in DSP 
Program and Testing 
by DSP Simulator 


Real-Time 
Testing 


Figure 15. Adaptive Filter Implementation Procedure 


In the first stage, algorithm design and study is performed on a personal computer. 
Once the algorithm is understood, the filter is implemented using a high-level C program 
with double precision coefficients and arithmetic. This filter is considered an ideal filter. 


In the second stage, the C program is rewritten in a way that emulates the same 
sequence of operations with the same parameters and state variables that will be implemented 
in the processors. This program then serves as a detailed outline for the DSP assembly 
language program or can be compiled using TMS320C25 or TMS320C30 C compiler. 
The effects of numerical errors can be measured directly by means of the technique shown 
in Figure 16, where H(z) is the ideal filter implemented in the first stage and H’(z) is 
a real filter. Optimization is performed to minimize the quantization error and produce 
stable implementation. 


Figure 16. A Commutational Technique for Evaluating Quantization Effects 


In the third stage, the TMS320C25 and TMS320C30 assembly programs are 
developed; then they are tested using the simulators with test data from a disk file. Note 
that the simulation of TMS320C25 can also be implemented on the SWDS with the data 
logging option. This test data is a short version of the data used in stage 2 that can be 
internally generated from a program or data digitized from a real application environ- 
ment. Output from the simulation is compared against the equivalent output of the C pro- 
gram in the second stage. Since the simulation requires data files to be in Q15 format, 
certain precision is lost during data conversion. When a one-to-one agreement within 
tolerable range is obtained between these two outputs, the processor software is assured 
to be essentially correct. 


The final stage is applied only to the TMS320C25. First, you download this assembled 
program into the target TMS320C25 system (SWDS) to initiate real-time operation. Thus, 
the real-time debugging process is constrained primarily to debugging the I/O timing struc- 
ture of the algorithm and testing the long-term stability of the algorithm. Figure 17 shows 
an experimental setup for verification, in which the adaptive filter is configured for a one- 
Step adaptive predictor illustrated in Figure 18. The data used for real-time testing is a 
sinusoid generated by a Tektronix FG504 Function Generator embedded in white noise 
generated by an HP Precision Noise Generator. The DSP gets a quantized signal from 
the Analog Interface Board (AIB), performs adaptive prediction routines, and outputs an ~ 
enhanced sinusoid to the analog interface board. The corrupted input and predicted (en- 
hanced) output waveforms are compared on the oscilloscope or on the HP 4361 Dynamic 
Signal Analyzer. The corresponding spectra of input and output can be compared on the 
signal analyzer. The signal-to-noise ratio (SNR) improvement can be measured from the 
analyzer, which is connected to an HP plotter. 


DSP DEVELOPMENT SYSTEM 


PERSONAL 
COMPUTER 


(SWDS and AIB) 


TEK 2235 
SCOPE 


FG504 
FUNCTION 
GENERATOR 


3 HP3561A 
DYNAMIC 

SIGNAL 
ni ANALYZER 


PRECISION 
NOISE 
GENERATOR 


HP PLOTTER 


Figure 17. Real-Time Experiment Setup 


e(n) 


Enhanced 
Output 


Adaptive 
Filter 


X(n-1) 


Figure 18. Block Diagram of a One-Step Adaptive Predictor 


To illustrate the operation in a nonstationary environment, the adaptive predictor 
is implemented using a TMS320C25, and the following experiment is performed. The 
input signal is swept from 1287 Hz to 4025 Hz, then jumps back to 1287 Hz. The time 
for each sweep is one second. The input spectra at every second are shown in Figure 19a; 
the corresponding output spectra are shown in Figure 19b. From the observations on the 


oscilloscope and signal analyzer, the significant SNR improvement, convergence speed, 
ability to track nonstationary signals, and long-term stability of the adaptive predictor are 
observed. 
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Figure 19(a). Spectrum of Input Signal 
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Figure 19(b). Spectrum of Enhanced Output Signal 
Summary 


Three adaptive structures and six update algorithms are implemented with the 
TMS320C25 and TMS320C30. Applications of adaptive filters and implementation con- 
siderations have been discussed. Two subroutine libraries that support both C language 
and assembly language for two processors were developed. These routines can be readily 
incorporated into TMS320C25 or TMS320C30 users’ application programs. 


The advancements in the TMS320C25 and TMS320C30 devices have made the im- 
plementation of sophisticated adaptive algorithms oriented toward performing real-time 
processing tasks feasible. Many adaptive signal processing algorithms are readily available 
and capable of solving real-time problems when implemented on the DSP. These pro- 
grams provide an efficient way to implement the widely used structures and algorithms 
on the TMS320C25 and TMS320C30, based on assembly-language programming. They 
are also extremely useful for choosing an algorithm for a given application. The perfor- 
mances of adaptive structures and algorithms that have been implemented using the 
TMS320C25 and TMS320C30 have been summarized in Tables 1 and 2. 


Table 1. The Performance of Adaptive Structures and Algorithms of TMS320C25 
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Note: N represents filter order. 


Table 2. The Performance of Adaptive Structures and Algorithms of TMS320C30 
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Appendix H1. Assembly Subroutine of Transversal Structure with 
LMS Algorithm Using the TMS320C25 
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Appendix H4. Assembly Subroutine of Transversal Structure with 
LMS Algorithm Using the TMS320C30 
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